-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IN PROGRESS general dev contributions. Currently tracking: #17
Conversation
WalkthroughThe changes integrate new stages and refine configuration pipelines for preprocessing and extraction scripts. They introduce metadata collection, data normalization, and tokenization processes, restructured configurations, and filtering mechanisms for patients, codes, and outliers. The new functionalities enhance the data processing pipeline with better handling of code metadata and preparatory tasks for subsequent data analysis. Changes
Poem
(_/) Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
…s will also facilitate joining with ontologies and downstream usage (and it tests that, more generally pre-processing oriented utility).
Added a code metadata extraction step for core MEDS extraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Outside diff range and nitpick comments (3)
tests/test_extraction.py (3)
437-437
: Add more comprehensive testing coverage.The warning message indicates that only the 'train/0' split has been checked. It would be beneficial to extend this coverage to other splits as well to ensure thorough testing of all scenarios.
454-454
: Consider verifying the contents of code metadata.The logger warning indicates that the contents of the code metadata were not checked. It might be beneficial to add a test or check to ensure the accuracy and integrity of the metadata contents.
Line range hint
330-330
: Usekey in dict
instead ofkey in dict.keys()
.This is a minor performance enhancement. Using
key in dict
is more Pythonic and slightly more efficient than usingkey in dict.keys()
.- if key in dict.keys(): + if key in dict:
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (11)
- configs/extraction.yaml (4 hunks)
- configs/preprocess.yaml (1 hunks)
- scripts/extraction/collect_code_metadata.py (1 hunks)
- scripts/preprocessing/add_time_derived_measurements.py (3 hunks)
- scripts/preprocessing/collect_code_metadata.py (3 hunks)
- scripts/preprocessing/filter_codes.py (1 hunks)
- scripts/preprocessing/filter_outliers.py (1 hunks)
- scripts/preprocessing/filter_patients.py (3 hunks)
- src/MEDS_polars_functions/code_metadata.py (9 hunks)
- src/MEDS_polars_functions/filter_measurements.py (1 hunks)
- tests/test_extraction.py (1 hunks)
Files not reviewed due to errors (3)
- scripts/preprocessing/filter_codes.py (no review received)
- scripts/preprocessing/filter_outliers.py (no review received)
- scripts/preprocessing/add_time_derived_measurements.py (no review received)
Files skipped from review due to trivial changes (1)
- scripts/preprocessing/filter_patients.py
Additional context used
Ruff
src/MEDS_polars_functions/filter_measurements.py
8-8: Module level import not at top of file (E402)
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
tests/test_extraction.py
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
15-15: Module level import not at top of file (E402)
330-330: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (23)
configs/preprocess.yaml (4)
28-28
: Configuration forage
andtime_of_day
enhances pipeline flexibility and specificity.
36-40
: Renaming and addition ofdo_summarize_over_all_codes
align well with enhanced data aggregation needs.
46-52
: Enhanced outlier detection configurations contribute to the robustness and accuracy of the data preprocessing pipeline.
55-55
: Updated aggregations infit_normalization
are consistent with the overall improvements in data processing.scripts/preprocessing/collect_code_metadata.py (6)
31-32
: Check for consistent handling of paths and directory structures.Also applies to: 34-34
Verification successful
The search results indicate that the
Path
object from thepathlib
module is used consistently across various files to handle paths and directory structures. The handling of paths appears to be uniform and follows a standard approach.
- Consistent Usage: The
Path
object is imported and used consistently across multiple files to define input and output directories.- Directory Structures: The directory structures are handled uniformly with the
Path
object, ensuring consistency in path manipulations.No inconsistencies were found in the handling of paths and directory structures.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure consistency in path handling. rg --type python --glob '*.py' 'Path'Length of output: 68
Script:
#!/bin/bash # Description: Ensure consistency in path handling by searching for 'Path' usage in Python files. rg 'Path' --glob '*.py'Length of output: 7182
19-19
: Ensure the correct configuration file is referenced.Verification successful
The configuration path appears to be consistently referenced across multiple scripts. Specifically, the
config_path
is set to"../../configs"
for various preprocessing and extraction scripts. This indicates that the configuration path is correctly updated and referenced.
scripts/preprocessing/collect_code_metadata.py:@hydra.main(version_base=None, config_path="../../configs", config_name="preprocess")
Other scripts also follow the same pattern, ensuring consistency.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the configuration path is correctly updated across the project. rg --type python --glob '*.py' '@hydra.main'Length of output: 75
Script:
#!/bin/bash # Description: Verify the configuration path is correctly updated across the project. rg --glob '*.py' '@hydra.main'Length of output: 1539
79-79
: Confirm that the output file path and name are correctly handled.Verification successful
The output from the shell script confirms that the file path and name
code_metadata.parquet
are used consistently across multiple files in the codebase. The relevant occurrences are:
verify_parquet.py
tests/test_extraction.py
scripts/preprocessing/filter_outliers.py
scripts/preprocessing/collect_code_metadata.py
scripts/preprocessing/filter_codes.py
scripts/extraction/collect_code_metadata.py
This indicates that the output file path and name are correctly handled.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the output file path and name are correct. rg --type python --glob '*.py' 'code_metadata.parquet'Length of output: 85
Script:
#!/bin/bash # Description: Verify the output file path and name are correct. rg --type py 'code_metadata.parquet'Length of output: 699
39-39
: Verify the correct configuration parameters are passed tomapper_fntr
.Verification successful
The search results show that
mapper_fntr
is consistently called with two parameters, typicallystage_cfg
and a second parameter that varies (e.g.,None
,ListConfig(code_modifier_columns)
,code_modifier_columns
). This suggests that the correct configuration parameters are being passed as per different contexts.
src/MEDS_polars_functions/code_metadata.py
scripts/extraction/collect_code_metadata.py
scripts/preprocessing/collect_code_metadata.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts. rg --type python --glob '*.py' 'mapper_fntr'Length of output: 75
Script:
#!/bin/bash # Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts. rg --type py --glob '*.py' 'mapper_fntr'Length of output: 1344
47-47
: Ensure output paths are handled consistently and correctly.Verification successful
The output paths are handled consistently across the codebase with appropriate directory variables (
output_dir
,metadata_input_dir
, andMEDS_cohort_dir
).
verify_parquet.py
: Direct assignment tofile_path
.tests/test_extraction.py
: Assignment tooutput_file
usingMEDS_cohort_dir
.scripts/extraction/collect_code_metadata.py
andscripts/preprocessing/collect_code_metadata.py
: Usingoutput_dir
.scripts/preprocessing/filter_codes.py
andscripts/preprocessing/filter_outliers.py
: Usingmetadata_input_dir
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that output paths are consistently handled. rg --type python --glob '*.py' '/ "code_metadata.parquet"'Length of output: 91
Script:
#!/bin/bash # Description: Verify that output paths are consistently handled. rg --glob '*.py' 'code_metadata.parquet'Length of output: 703
Script:
#!/bin/bash # Extract relevant lines from the files identified by the previous search rg 'code_metadata.parquet' verify_parquet.py tests/test_extraction.py scripts/extraction/collect_code_metadata.py scripts/preprocessing/filter_codes.py scripts/preprocessing/collect_code_metadata.py scripts/preprocessing/filter_outliers.pyLength of output: 902
76-76
: Check the configuration parameters passed toreducer_fntr
for consistency.Verification successful
The parameters passed to
reducer_fntr
in bothscripts/extraction/collect_code_metadata.py
andscripts/preprocessing/collect_code_metadata.py
are consistent with its definition. Specifically, both files passcfg.stage_cfg
andcfg.get("code_modifier_columns", None)
, which align with the parametersstage_cfg
andcode_modifier_columns
expected byreducer_fntr
.
scripts/preprocessing/collect_code_metadata.py: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
scripts/extraction/collect_code_metadata.py: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure that `reducer_fntr` is consistently used across the project. rg --type python --glob '*.py' 'reducer_fntr'Length of output: 76
Script:
#!/bin/bash # Description: Ensure that `reducer_fntr` is consistently used across the project. rg -t py 'reducer_fntr'Length of output: 1032
Script:
#!/bin/bash # Description: Retrieve the definition of `reducer_fntr` and its usages for detailed analysis. ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n $$$'Length of output: 70
Script:
#!/bin/bash # Description: Retrieve the definition of `reducer_fntr` and its usages for detailed analysis. ast-grep --lang python --pattern 'def reducer_fntr'Length of output: 398
Script:
#!/bin/bash # Description: Search for the definition of `reducer_fntr` in the specific file. rg 'def reducer_fntr' src/MEDS_polars_functions/code_metadata.pyLength of output: 84
Script:
#!/bin/bash # Description: Extract the full definition of `reducer_fntr` by capturing multiple lines from its definition. rg -A 10 'def reducer_fntr' src/MEDS_polars_functions/code_metadata.pyLength of output: 800
Script:
#!/bin/bash # Description: Verify the parameters passed to `reducer_fntr` in the specified files. rg 'reducer_fntr' scripts/extraction/collect_code_metadata.py scripts/preprocessing/collect_code_metadata.pyLength of output: 608
scripts/extraction/collect_code_metadata.py (6)
80-80
: Confirm that the output file path and name are correctly handled.Verification successful
The output file path and name for
code_metadata.parquet
appear to be correctly handled in multiple instances across the codebase, including the filescripts/extraction/collect_code_metadata.py
.
scripts/extraction/collect_code_metadata.py
:write_lazyframe(reduced, output_dir / "code_metadata.parquet")
scripts/preprocessing/collect_code_metadata.py
:write_lazyframe(reduced, output_dir / "code_metadata.parquet")
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the output file path and name are correct. rg --type python --glob '*.py' 'code_metadata.parquet'Length of output: 85
Script:
#!/bin/bash # Description: Verify the output file path and name are correct. rg 'code_metadata.parquet' --glob '*.py'Length of output: 703
19-19
: Ensure the correct configuration file is referenced.Verification successful
The
@hydra.main
decorator inscripts/extraction/collect_code_metadata.py
correctly references the../../configs
path for theextraction
configuration, consistent with other scripts in theextraction
folder. No issues found.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the configuration path is correctly updated across the project. rg --type python --glob '*.py' '@hydra.main'Length of output: 75
Script:
#!/bin/bash # Description: Verify the configuration path is correctly updated across the project. rg --glob '*.py' '@hydra.main'Length of output: 1539
40-40
: Verify the correct configuration parameters are passed tomapper_fntr
.Verification successful
The
mapper_fntr
function is consistently called withstage_cfg
and eitherNone
orcode_modifier_columns
across different scripts. This indicates that the configuration parameters passed tomapper_fntr
inscripts/extraction/collect_code_metadata.py
are correct and consistent with other usages.
src/MEDS_polars_functions/code_metadata.py
:
mapper_fntr(stage_cfg, None)
mapper_fntr(stage_cfg, ListConfig(code_modifier_columns))
mapper_fntr(stage_cfg, code_modifier_columns)
scripts/preprocessing/collect_code_metadata.py
:
mapper_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
scripts/extraction/collect_code_metadata.py
:
mapper_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts. rg --type python --glob '*.py' 'mapper_fntr'Length of output: 75
Script:
#!/bin/bash # Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts. rg --type py 'mapper_fntr'Length of output: 1330
48-48
: Ensure output paths are handled consistently and correctly.
31-33
: Check for consistent handling of paths and directory structures.Also applies to: 35-35
Verification successful
The search results show consistent usage of
Path
from thepathlib
module across various scripts. The pattern of converting configuration paths intoPath
objects appears to be uniformly applied. Here are some examples:
tests/test_extraction.py
src/MEDS_polars_functions/utils.py
scripts/preprocessing/filter_patients.py
scripts/preprocessing/filter_outliers.py
scripts/preprocessing/filter_codes.py
scripts/preprocessing/collect_code_metadata.py
scripts/extraction/collect_code_metadata.py
scripts/extraction/split_and_shard_patients.py
scripts/extraction/shard_events.py
scripts/extraction/merge_to_MEDS_cohort.py
scripts/extraction/convert_to_sharded_events.py
eICU_Example/pre_MEDS.py
MIMIC-IV_Example/pre_MEDS.py
Given the extensive and consistent usage of
Path
across different modules and scripts, it appears that the handling of paths and directory structures is consistent.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure consistency in path handling. rg --type python --glob '*.py' 'Path'Length of output: 68
Script:
#!/bin/bash # Description: Ensure consistency in path handling. rg --type py --glob '*.py' 'Path'Length of output: 7192
77-77
: Check the configuration parameters passed toreducer_fntr
for consistency.Verification successful
The
reducer_fntr
function insrc/MEDS_polars_functions/code_metadata.py
expects two parameters:stage_cfg
andcode_modifier_columns
. Both instances wherereducer_fntr
is called inscripts/preprocessing/collect_code_metadata.py
andscripts/extraction/collect_code_metadata.py
pass these parameters as expected.
scripts/preprocessing/collect_code_metadata.py
- Line:
reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
scripts/extraction/collect_code_metadata.py
- Line:
reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
Both calls are consistent with the definition of
reducer_fntr
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure that `reducer_fntr` is consistently used across the project. rg --type python --glob '*.py' 'reducer_fntr'Length of output: 76
Script:
#!/bin/bash # Description: Ensure that `reducer_fntr` is consistently used across the project. rg --type py --glob '*.py' 'reducer_fntr'Length of output: 1050
Script:
#!/bin/bash # Description: Extract the definition of `reducer_fntr`. ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n $$$' --json # Extract the code snippets where `reducer_fntr` is called. rg --type py --glob '*.py' 'reducer_fntr('Length of output: 202
Script:
#!/bin/bash # Description: Extract the definition of `reducer_fntr`. ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n $$$' # Extract the code snippets where `reducer_fntr` is called. rg --type py --glob '*.py' 'reducer_fntr\('Length of output: 886
Script:
#!/bin/bash # Display the content of `src/MEDS_polars_functions/code_metadata.py` to manually review the `reducer_fntr` definition. cat src/MEDS_polars_functions/code_metadata.pyLength of output: 30089
configs/extraction.yaml (2)
19-20
: Confirm that the new stagecollect_code_metadata
and its settings are appropriately configured.Also applies to: 27-27
Verification successful
The
collect_code_metadata
stage is properly defined with a description and specific aggregations. This configuration appears consistent with other stages in the YAML file.
configs/extraction.yaml
:
- Line 19-20:
code_modifiers: null
- Line 27:
collect_code_metadata
stage configurationScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Ensure that the new stage settings are consistent with other stages. rg --type yaml --glob '*.yaml' 'collect_code_metadata'Length of output: 272
Script:
#!/bin/bash # Extract the full configuration of the `collect_code_metadata` stage from the YAML file. rg -A 10 'collect_code_metadata:' configs/extraction.yamlLength of output: 619
74-89
: Check the configuration settings for thecollect_code_metadata
stage for correctness and completeness.Verification successful
The
collect_code_metadata
stage in theconfigs/extraction.yaml
file seems to be well-defined and complete with necessary fields such asdescription
,aggregations
,do_summarize_over_all_codes
,is_metadata
,mapper_output_dir
, andoutput_dir
.
description
provides a detailed explanation of the stage.aggregations
lists the specific aggregations to compute.do_summarize_over_all_codes
is set to true, indicating that overall, code-independent counts should be included.is_metadata
is set to true.mapper_output_dir
andoutput_dir
specify the directories for output.No issues were found with the configuration.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the settings for `collect_code_metadata` are complete and correct. rg --type yaml --glob '*.yaml' 'collect_code_metadata'Length of output: 272
Script:
#!/bin/bash # Description: List all fields under the `collect_code_metadata` stage in the `configs/extraction.yaml` file for verification. awk '/collect_code_metadata:/,/^[^ ]/' configs/extraction.yamlLength of output: 864
tests/test_extraction.py (1)
441-453
: Ensure the output file existence is verified properly.The code correctly checks for the existence of the output file 'code_metadata.parquet'. However, it would be prudent to also verify the contents of this file to ensure that the data meets expected conditions. Consider adding a check that reads the file and validates some expected metadata properties.
src/MEDS_polars_functions/code_metadata.py (4)
9-9
: Ensure the added imports are used effectively throughout the codebase.
122-195
: Updated functionvalidate_args_and_get_code_cols
to handle new configuration structures and validate them against theMETADATA_FN
enum.
- The function now properly raises exceptions for incorrect configurations, which enhances robustness.
- The use of
OmegaConf
for configuration parsing is a good practice in projects where configurations might vary significantly and need to be managed centrally.
Line range hint
196-382
: Themapper_fntr
function has been updated to handle dynamic aggregation configurations:
- The function now dynamically constructs aggregation operations based on the configuration, which adds flexibility to the data processing pipeline.
- The use of pattern matching (
match-case
) for handlingcode_modifier_columns
is a modern Python feature that improves code readability and maintainability.
Line range hint
386-531
: Thereducer_fntr
function has been enhanced to merge data from multiple frames based on the configuration:
- The function now uses a more dynamic approach to handle different aggregation types, which allows for more flexible data reduction strategies.
- The error handling for missing columns during the reduction process is robust, ensuring that the function fails gracefully with an informative error message if expected columns are missing.
…ong; shards should come from the raw input dir for pre-processing jobs.
…ands to the MIMIC example.
…ted a bug in collection of code metadata
…orch_datasets Tokenization and Tensorization
Get the preprocessing pipeline working E2E
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Outside diff range and nitpick comments (12)
scripts/preprocessing/tensorize.py (1)
20-20
: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.scripts/preprocessing/filter_outliers.py (1)
19-19
: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.scripts/preprocessing/fit_vocabulary_indices.py (1)
19-19
: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.scripts/preprocessing/normalize.py (1)
18-28
: Consider providing a detailed function description in the docstring.Adding a detailed docstring will improve the maintainability of the code by providing more context on what the function does and how it uses its parameters.
scripts/preprocessing/tokenization.py (1)
20-30
: Consider adding a detailed function description in the docstring.A detailed docstring would provide clarity on the function's purpose, parameters, and expected behavior, enhancing code maintainability.
src/MEDS_polars_functions/tensorize.py (1)
10-25
: Enhance the function's documentation to include more detailed examples and error handling scenarios.Expanding the documentation will help users understand the function's capabilities and limitations better, especially in handling different data formats and error scenarios.
MIMIC-IV_Example/joint_script_slurm.sh (1)
[!TIP]
Codebase VerificationThe review comment suggested ensuring that the slurm scripts handle failures gracefully and that parameters are correctly passed to the sub-scripts. The shell script analysis revealed the following:
- Parameter Usage: Parameters such as
N_PARALLEL_WORKERS
,MIMICIV_PREMEDS_DIR
,MIMICIV_MEDS_DIR
, andevent_conversion_config_fp
are correctly used in the sub-scripts.- Error Handling: There is minimal explicit error handling (
try
,except
,catch
,trap
) in the sub-scripts. This indicates that the scripts may not handle failures as gracefully as expected.To improve robustness, it is recommended to add explicit error handling in the sub-scripts to manage potential failures effectively.
- Sub-scripts to review:
scripts/extraction/convert_to_sharded_events.py
scripts/extraction/merge_to_MEDS_cohort.py
scripts/extraction/shard_events.py
scripts/extraction/split_and_shard_patients.py
Category tags:
Here are the actionable points:
- Add error handling mechanisms in the sub-scripts to ensure they handle failures gracefully.
- Review and potentially update the error handling logic in the identified sub-scripts.
Analysis chain
Line range hint
47-112
: Ensure that the slurm scripts handle failures gracefully and that the parameters are correctly passed to the sub-scripts.Also, check that the parameters like
N_PARALLEL_WORKERS
and directory paths are correctly used in the sub-scripts.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the robustness of error handling in the slurm scripts and the correctness of parameter passing. # Test: Simulate a failure in one of the sub-scripts and check if it's handled gracefully. Expect: Graceful handling of failures. ./scripts/extraction/test_failure_simulation.shLength of output: 172
Script:
#!/bin/bash # Find the sub-scripts mentioned in the main script fd -e py -e sh # Check the usage of parameters and error handling in the sub-scripts for script in $(fd -e py -e sh); do echo "Inspecting: $script" echo "Parameter usage:" rg 'N_PARALLEL_WORKERS|MIMICIV_PREMEDS_DIR|MIMICIV_MEDS_DIR|event_conversion_config_fp' "$script" echo "Error handling:" rg 'try|except|catch|trap' "$script" doneLength of output: 24247
src/MEDS_polars_functions/utils.py (1)
Line range hint
114-114
: Optimize dictionary key existence check.- for s in stage_configs.keys(): + for s in stage_configs:MIMIC-IV_Example/README.md (2)
Line range hint
192-192
: Consider replacing "lots of" with a more precise term.Consider using "many" instead of "lots of" for a more formal tone in documentation.
Line range hint
199-199
: Add possessive apostrophe to "patients".- How to handle the dob nonsense MIMIC has? + How to handle the dob nonsense MIMIC's has?README.md (1)
358-382
: Consider documenting or eventually supporting the "In-memory Training" and "Fixed-batch Retrieval" methods, as these could be beneficial for specific use cases.Tools
LanguageTool
[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...
[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...src/MEDS_polars_functions/code_metadata.py (1)
Line range hint
386-530
: Thereducer_fntr
function is well-structured and includes detailed error handling and input validation. It effectively uses the aggregation operations defined inCODE_METADATA_AGGREGATIONS
. The dynamic renaming of columns to handle multiple dataframes in the reduction process is a clever use of Polars' capabilities. However, add a check to ensurestage_cfg
is of the correct type.+ if not isinstance(stage_cfg, DictConfig): + raise TypeError("Expected stage_cfg to be of type DictConfig")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (25)
- MIMIC-IV_Example/README.md (4 hunks)
- MIMIC-IV_Example/joint_script_slurm.sh (2 hunks)
- README.md (1 hunks)
- configs/pipeline.yaml (1 hunks)
- configs/preprocess.yaml (1 hunks)
- preprocessing_operation_prototypes.md (1 hunks)
- pyproject.toml (1 hunks)
- scripts/extraction/collect_code_metadata.py (1 hunks)
- scripts/preprocessing/add_time_derived_measurements.py (2 hunks)
- scripts/preprocessing/collect_code_metadata.py (5 hunks)
- scripts/preprocessing/filter_codes.py (1 hunks)
- scripts/preprocessing/filter_outliers.py (1 hunks)
- scripts/preprocessing/filter_patients.py (2 hunks)
- scripts/preprocessing/fit_vocabulary_indices.py (1 hunks)
- scripts/preprocessing/normalize.py (1 hunks)
- scripts/preprocessing/tensorize.py (1 hunks)
- scripts/preprocessing/tokenization.py (1 hunks)
- src/MEDS_polars_functions/code_metadata.py (9 hunks)
- src/MEDS_polars_functions/filter_measurements.py (1 hunks)
- src/MEDS_polars_functions/get_vocabulary.py (1 hunks)
- src/MEDS_polars_functions/normalization.py (5 hunks)
- src/MEDS_polars_functions/tensorize.py (1 hunks)
- src/MEDS_polars_functions/tokenize.py (1 hunks)
- src/MEDS_polars_functions/utils.py (1 hunks)
- terminology.md (1 hunks)
Files skipped from review due to trivial changes (2)
- configs/pipeline.yaml
- pyproject.toml
Files skipped from review as they are similar to previous changes (6)
- configs/preprocess.yaml
- scripts/extraction/collect_code_metadata.py
- scripts/preprocessing/add_time_derived_measurements.py
- scripts/preprocessing/collect_code_metadata.py
- scripts/preprocessing/filter_codes.py
- scripts/preprocessing/filter_patients.py
Additional context used
Markdownlint
terminology.md
3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a timeMIMIC-IV_Example/README.md
183-183: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading
43-43: null (MD034, no-bare-urls)
Bare URL usedpreprocessing_operation_prototypes.md
42-42: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time
70-70: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time
24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style
24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis styleREADME.md
254-254: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading
417-417: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading
Ruff
src/MEDS_polars_functions/utils.py
114-114: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
src/MEDS_polars_functions/filter_measurements.py
8-8: Module level import not at top of file (E402)
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
LanguageTool
MIMIC-IV_Example/README.md
[uncategorized] ~87-~87: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...ry we'll denote as$MIMICIV_MEDS_DIR
. Note this is a different directory than the ...
[style] ~192-~192: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...etimeevents5.
icu/ingredientevents` Lots of questions remain about how to appropria...
[uncategorized] ~199-~199: It seems likely that a singular genitive (’s) apostrophe is missing. (AI_HYDRA_LEO_APOSTROPHE_S_XS)
Context: ...athtimes between the hosp table and the patients table? 2. How to handle the dob nonsens...preprocessing_operation_prototypes.md
[style] ~7-~7: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...that can be applied to MEDS datasets in a variety of circumstances to accomplish diverse, ye...
[style] ~32-~32: Consider replacing this phrase with the adverb “consistently” to avoid wordiness. (IN_A_X_MANNER)
Context: ...ts being merged into the output dataset in a consistent manner. Currently, these capabilities are only...
[uncategorized] ~45-~45: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...olumns should not be modified in this step as that will break the linkage with ...
[uncategorized] ~56-~56: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...paradigm is not yet implemented. ##### Currently Supported Operations Functions: 1. As...
[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...
[style] ~84-~84: Consider shortening this phrase to just ‘whether’, unless you mean ‘regardless of whether’. (WHETHER)
Context: ...nto a single, unified metadata file. 3. Whether or not aggregation functions should be compute...
[style] ~89-~89: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ... lacks support for patient-data filters prior to aggregation. ##### Currently supported...
[uncategorized] ~91-~91: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...ta filters prior to aggregation. ##### Currently supported operations Patient Filters: ...
[style] ~120-~120: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...move all data corresponding to patients on the basis of the resulting criteria. 3. Return the f...
[uncategorized] ~136-~136: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...s/filter_patients_by_length.py`. ##### Currently supported operations 1. Filtering pati...
[uncategorized] ~168-~168: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...h extension relatively natively. ##### Currently supported operations Currently, measur...
[style] ~170-~170: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...Currently, measurements can be filtered on the basis ofmin_patients_per_code
and `min_occurr...
[uncategorized] ~197-~197: Use a comma before “and” if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2)
Context: ...ion that the indicator columns are added and this function is not reversible. #####...
[uncategorized] ~219-~219: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ... is not yet a general prototype. ##### Currently supported operations 1. Occluding nume...README.md
[grammar] ~9-~9: “Google” is a proper noun and needs to be capitalized. (A_GOOGLE)
Context: ...e the "Roadmap" section below and [this google doc](https://docs.google.com/document/d...
[grammar] ~98-~98: The modal verb ‘will’ requires the verb’s base form. (MD_BASEFORM)
Context: ...worker jobs. Note this will typically required a distributed file system to work corre...
[uncategorized] ~102-~102: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...ficant speedup depending on the machine configuration as sharding ensures that parallelism ca...
[style] ~109-~109: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...idual stage of the pipeline can be kept extremely simple (often each stage corresponds to a sing...
[style] ~112-~112: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...parallelization) use of the pipeline in a variety of environments, all without imposing sign...
[style] ~157-~157: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing. (REP_NEED_TO_VB)
Context: ...s and external ontologies? Will they need to specify select key-event concepts to id...
[uncategorized] ~183-~183: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acces...
[typographical] ~211-~211: Consider adding a comma here. (FOR_NOW_COMMA)
Context: ...the other valid nested MEDS format, but for now we leave that up to the user. #### Inp...
[style] ~275-~275: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...r downstream use. This package contains a variety of pre-processing transformations and scri...
[style] ~282-~282: This phrase is redundant. Consider using “outside”. (OUTSIDE_OF)
Context: ... - Removing numerical values that fall outside of pre-specified, per-code ranges (e.g., f...
[style] ~291-~291: Consider using “who” when you are referring to a person instead of an object. (THAT_WHO)
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...
[style] ~300-~300: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...s each code occurs in the dataset. - Computing appropriate numerical bins for each cod...
[uncategorized] ~304-~304: Possible missing preposition found. (AI_HYDRA_LEO_MISSING_TO)
Context: ...sure that the only columns that need be processed going into the pre-processing...
[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...
[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...
[grammar] ~417-~417: It appears that a hyphen is missing in the plural noun “to-dos”? (TO_DO_HYPHEN)
Context: ...bmitit cluster parallelism support. ## TODOs: 1. We need to have a vehicle to clean...
[grammar] ~422-~422: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.) (KIND_OF_A)
Context: ... right prior files. Likely need some kind of a "prior stage name" config variable.
Additional comments not posted (26)
scripts/preprocessing/normalize.py (1)
39-58
: Validate the use ofpl.read_parquet
withuse_pyarrow=True
.Also, ensure that
rwlock_wrap
function handles thepl.scan_parquet
andwrite_lazyframe
correctly, especially with thedo_overwrite
configuration.scripts/preprocessing/tokenization.py (1)
40-69
: Ensure consistent handling of file paths and error handling in the tokenization process.Also, verify that
rwlock_wrap
correctly handlespl.scan_parquet
andwrite_lazyframe
functions, especially considering thedo_overwrite
configuration.src/MEDS_polars_functions/normalization.py (1)
7-7
: Ensure thecode_modifiers
parameter is used correctly to filter the MEDS dataset.Verification successful
The
code_modifiers
parameter is indeed used correctly in thenormalize
function to filter the MEDS dataset. The parameter is checked for aNone
value, initialized if necessary, and then utilized in the join operation and column selection.
- Initialization: If
code_modifiers
isNone
, it is set to an empty list.- Column Selection:
code_modifiers
is added to the list of columns to select fromcode_metadata
.- Join Operation: The join operation uses
code_modifiers
to join thedf
andcode_metadata
dataframes.Additionally, the
code_modifiers
parameter is referenced in other related functions and scripts, ensuring its consistent usage across the codebase.
src/MEDS_polars_functions/normalization.py
- Line 7: Parameter definition
- Lines 41-42: Handling
None
value- Line 44: Column selection
- Line 56: Join operation
src/MEDS_polars_functions/get_vocabulary.py
- Various functions:
validate_code_metadata
,lexicographic_indices
scripts/preprocessing/normalize.py
- Line 14: Configuration retrieval
- Line 15: Partial function application
scripts/preprocessing/fit_vocabulary_indices.py
- Lines 14-17: Configuration retrieval and initialization
- Line 19: Function application
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the usage of the `code_modifiers` parameter in the normalization function. rg --type python 'code_modifiers' --context 5Length of output: 76
Script:
#!/bin/bash # Verify the usage of the `code_modifiers` parameter in the normalization function. rg --type py 'code_modifiers' --context 5Length of output: 16822
src/MEDS_polars_functions/utils.py (1)
153-155
: Add type checking foris_metadata
to prevent runtime errors.Proper type validation ensures robustness and prevents unexpected behavior during runtime.
src/MEDS_polars_functions/tokenize.py (2)
90-92
: Ensure proper separation of static and dynamic data based on timestamp.The logic correctly separates data into static and dynamic parts based on the presence of a timestamp.
139-151
: Verify the correct joining of static and dynamic dataframes.Verification successful
The full implementation of the
extract_statics_and_schema
function has been successfully retrieved. The function joinsstatic_by_patient
andschema_by_patient
dataframes on thepatient_id
column using an inner join.
- The
static_by_patient
dataframe is grouped bypatient_id
and aggregates thecode
andnumerical_value
columns.- The
schema_by_patient
dataframe is grouped bypatient_id
and aggregates the minimumtimestamp
asstart_time
and collects uniquetimestamp
values.The join operation appears to be correctly implemented, joining on the
patient_id
column with an inner join.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the correct joining of static and dynamic dataframes in `extract_statics_and_schema`. rg --type python 'join' --context 5 src/MEDS_polars_functions/tokenize.pyLength of output: 104
Script:
#!/bin/bash # Locate the function `extract_statics_and_schema` and extract its contents. ast-grep --lang python --pattern $'def extract_statics_and_schema($_):\n $$$' src/MEDS_polars_functions/tokenize.pyLength of output: 118
Script:
#!/bin/bash # Locate the function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`. rg 'def extract_statics_and_schema' -A 20 src/MEDS_polars_functions/tokenize.pyLength of output: 1061
Script:
#!/bin/bash # Locate the complete function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`. rg 'def extract_statics_and_schema' -A 50 src/MEDS_polars_functions/tokenize.pyLength of output: 2628
Script:
#!/bin/bash # Locate the complete function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`. rg 'def extract_statics_and_schema' -A 100 src/MEDS_polars_functions/tokenize.pyLength of output: 5110
MIMIC-IV_Example/README.md (1)
85-85
: Clarify the expected output directory for the MEDS extraction ETL.Verification successful
The README.md does provide information about the expected output directory for the MEDS extraction ETL. It mentions that the final MEDS dataset should be output into a directory denoted as
$MIMICIV_MEDS_DIR
.
- Line 85: "We will assume you want to output the final MEDS dataset into a directory we'll denote as
$MIMICIV_MEDS_DIR
."This clarification is present in the documentation, making the review comment unnecessary.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the documentation consistency for the output directory in MEDS extraction ETL. rg --type markdown 'output_dir' MIMIC-IV_Example/README.mdLength of output: 153
Script:
#!/bin/bash # Verify mentions of the MEDS extraction ETL in the README.md to check for references to the output directory. rg -A 5 'MEDS extraction ETL' MIMIC-IV_Example/README.mdLength of output: 940
src/MEDS_polars_functions/filter_measurements.py (3)
8-8
: Move the import statement to the top of the file to adhere to PEP 8 guidelines.Tools
Ruff
8-8: Module level import not at top of file (E402)
119-123
: Replace f-string docstrings with regular docstrings to prevent Python from interpreting them as joined strings.Tools
Ruff
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
207-211
: Replace f-string docstrings with regular docstrings to prevent Python from interpreting them as joined strings.Tools
Ruff
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
src/MEDS_polars_functions/get_vocabulary.py (3)
9-22
: Introduced an enumerationVOCABULARY_ORDERING
for defining vocabulary ordering methods. This is a clear and extensible way to manage different ordering strategies.
28-92
: Added thevalidate_code_metadata
function to ensure the integrity of code metadata. This function checks for the presence of required columns and the uniqueness of the code and modifiers, which is crucial for maintaining data consistency.
95-180
: Introducedlexicographic_indices
to assign vocabulary indices based on lexicographic order. This function is well-documented and includes thorough error handling and examples, which enhances maintainability and usability.preprocessing_operation_prototypes.md (9)
1-1
: The document header clearly states that the content is still in progress, setting the correct expectations for the reader.
6-12
: The introduction provides a good overview of the purpose of operation prototypes. This helps in understanding the context and utility of the described operations.Tools
LanguageTool
[style] ~7-~7: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...that can be applied to MEDS datasets in a variety of circumstances to accomplish diverse, ye...
14-16
: Use of the term "code" is well-explained, which is important for clarity and avoiding confusion given its potential overlap with programming terminology.
[APROVED]
18-25
: The distinction between "removing data" and "occluding data" is clearly made, which is crucial for understanding the implications of different operations on data integrity and traceability.Tools
Markdownlint
24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style
24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style
27-34
: Describes the flexibility of filtering prototypes and their ability to be applied conditionally based on various criteria. However, it notes that these capabilities are not yet implemented, which is good for setting realistic expectations.Tools
LanguageTool
[style] ~32-~32: Consider replacing this phrase with the adverb “consistently” to avoid wordiness. (IN_A_X_MANNER)
Context: ...ts being merged into the output dataset in a consistent manner. Currently, these capabilities are only...
37-41
: Explains the "Transform Codes" operation, focusing on static remapping of codes. The warning against modifyingcode
orcode_modifier
columns is crucial to maintain linkage with patient data.
64-77
: Describes the "Collect metadata about code realizations in patient data" operation. It provides a clear step-by-step breakdown of the operation, enhancing understandability.Tools
Markdownlint
70-70: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time
106-122
: Details different modes of filtering data from MEDS datasets. The step-by-step description of each mode helps in understanding the operational flow and potential applications.Tools
LanguageTool
[style] ~120-~120: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...move all data corresponding to patients on the basis of the resulting criteria. 3. Return the f...
180-234
: Mentions planned operations for transforming measurements within events. This forward-looking statement helps in understanding the future scope of the project.Tools
LanguageTool
[uncategorized] ~197-~197: Use a comma before “and” if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2)
Context: ...ion that the indicator columns are added and this function is not reversible. #####...
[uncategorized] ~219-~219: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ... is not yet a general prototype. ##### Currently supported operations 1. Occluding nume...README.md (2)
332-338
: The description of the tokenization process is clear and well-articulated.
342-382
: The tensorization section provides a comprehensive overview of different training modes, emphasizing the benefits of Direct Retrieval which is supported by this repository.Tools
LanguageTool
[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...
[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...src/MEDS_polars_functions/code_metadata.py (2)
9-9
: Added imports fromomegaconf
are appropriate for the configuration management required in this module.
Line range hint
196-382
: The functionmapper_fntr
is correctly implemented with comprehensive input validation and error handling, leveraging the previously definedvalidate_args_and_get_code_cols
. The use of Python's pattern matching forcode_modifier_columns
is modern and aids in clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- tests/test_extraction.py (3 hunks)
Additional context used
Ruff
tests/test_extraction.py
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
15-15: Module level import not at top of file (E402)
386-386: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (3)
tests/test_extraction.py (3)
150-204
: LGTM! The test data for various splits is well-defined and mirrors the expected structure of the MEDS outputs.
212-221
: The updatedSUB_SHARDED_OUTPUTS
dictionary correctly maps new test data structures to their respective splits. This change ensures that the test suite can accurately validate the data processing across different stages.
228-230
: TheMEDS_OUTPUTS
dictionary has been updated to reference the newly added data structures. This structure is crucial for ensuring that the output of the test cases can be validated against expected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range and nitpick comments (2)
tests/test_extraction.py (2)
Line range hint
7-14
: Consider moving module-level imports to the top of the file.+ import json + import subprocess + import tempfile + from io import StringIO + from pathlib import Path + import polars as pl + from polars.testing import assert_frame_equal + pl.enable_string_cache()Tools
Ruff
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
Line range hint
401-401
: Simplify the dictionary key check by removing.keys()
.- if key in dict.keys() + if key in dictTools
Ruff
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- tests/test_extraction.py (5 hunks)
Additional context used
Ruff
tests/test_extraction.py
7-7: Module level import not at top of file (E402)
8-8: Module level import not at top of file (E402)
9-9: Module level import not at top of file (E402)
10-10: Module level import not at top of file (E402)
11-11: Module level import not at top of file (E402)
13-13: Module level import not at top of file (E402)
14-14: Module level import not at top of file (E402)
401-401: Use
key in dict
instead ofkey in dict.keys()
(SIM118)Remove
.keys()
Additional comments not posted (1)
tests/test_extraction.py (1)
502-534
: Theassert_df_equal
function is well-implemented with detailed output for debugging.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- .github/workflows/tests.yaml (1 hunks)
- src/MEDS_polars_functions/filter_measurements.py (1 hunks)
Files skipped from review due to trivial changes (1)
- .github/workflows/tests.yaml
Additional context used
Ruff
src/MEDS_polars_functions/filter_measurements.py
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
Additional comments not posted (1)
src/MEDS_polars_functions/filter_measurements.py (1)
139-227
: Review offilter_outliers_fntr
function.The
filter_outliers_fntr
function introduces a robust method for filtering out outliers based on standard deviations. This is a crucial step in data preprocessing to ensure the quality of data before further analysis. The function is implemented efficiently using Polars' lazy frames, which is suitable for handling large datasets. However, similar to the previous function, adding more inline comments could improve understanding and maintainability.Tools
Ruff
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
>>> fn(data).collect() | ||
shape: (2, 3) | ||
┌────────────┬──────┬───────────┐ | ||
│ patient_id ┆ code ┆ modifier1 │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ cat ┆ i64 │ | ||
╞════════════╪══════╪═══════════╡ | ||
│ 1 ┆ A ┆ 1 │ | ||
│ 1 ┆ B ┆ 1 │ | ||
└────────────┴──────┴───────────┘ | ||
>>> stage_cfg = DictConfig({"min_patients_per_code": 1, "min_occurrences_per_code": 4}) | ||
>>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"]) | ||
>>> fn(data).collect() | ||
shape: (2, 3) | ||
┌────────────┬──────┬───────────┐ | ||
│ patient_id ┆ code ┆ modifier1 │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ cat ┆ i64 │ | ||
╞════════════╪══════╪═══════════╡ | ||
│ 1 ┆ A ┆ 1 │ | ||
│ 2 ┆ A ┆ 2 │ | ||
└────────────┴──────┴───────────┘ | ||
>>> stage_cfg = DictConfig({"min_patients_per_code": 1}) | ||
>>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"]) | ||
>>> fn(data).collect() | ||
shape: (4, 3) | ||
┌────────────┬──────┬───────────┐ | ||
│ patient_id ┆ code ┆ modifier1 │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ cat ┆ i64 │ | ||
╞════════════╪══════╪═══════════╡ | ||
│ 1 ┆ A ┆ 1 │ | ||
│ 1 ┆ B ┆ 1 │ | ||
│ 2 ┆ A ┆ 2 │ | ||
│ 2 ┆ C ┆ 2 │ | ||
└────────────┴──────┴───────────┘ | ||
>>> stage_cfg = DictConfig({"min_patients_per_code": None, "min_occurrences_per_code": None}) | ||
>>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"]) | ||
>>> fn(data).collect() | ||
shape: (4, 3) | ||
┌────────────┬──────┬───────────┐ | ||
│ patient_id ┆ code ┆ modifier1 │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ cat ┆ i64 │ | ||
╞════════════╪══════╪═══════════╡ | ||
│ 1 ┆ A ┆ 1 │ | ||
│ 1 ┆ B ┆ 1 │ | ||
│ 2 ┆ A ┆ 2 │ | ||
│ 2 ┆ C ┆ 2 │ | ||
└────────────┴──────┴───────────┘ | ||
>>> stage_cfg = DictConfig({"min_occurrences_per_code": 5}) | ||
>>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"]) | ||
>>> fn(data).collect() | ||
shape: (1, 3) | ||
┌────────────┬──────┬───────────┐ | ||
│ patient_id ┆ code ┆ modifier1 │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ i64 ┆ cat ┆ i64 │ | ||
╞════════════╪══════╪═══════════╡ | ||
│ 2 ┆ A ┆ 2 │ | ||
└────────────┴──────┴───────────┘ | ||
""" | ||
|
||
min_patients_per_code = stage_cfg.get("min_patients_per_code", None) | ||
min_occurrences_per_code = stage_cfg.get("min_occurrences_per_code", None) | ||
|
||
filter_exprs = [] | ||
if min_patients_per_code is not None: | ||
filter_exprs.append(pl.col("code/n_patients") >= min_patients_per_code) | ||
if min_occurrences_per_code is not None: | ||
filter_exprs.append(pl.col("code/n_occurrences") >= min_occurrences_per_code) | ||
|
||
if not filter_exprs: | ||
return lambda df: df | ||
|
||
join_cols = ["code"] | ||
if code_modifier_columns: | ||
join_cols.extend(code_modifier_columns) | ||
|
||
allowed_code_metadata = (code_metadata.filter(pl.all_horizontal(filter_exprs)).select(join_cols)).lazy() | ||
|
||
def filter_codes_fn(df: pl.LazyFrame) -> pl.LazyFrame: | ||
f"""Filters patient events to only encompass those with a set of permissible codes. | ||
|
||
In particular, this function filters the DataFrame to only include (code, modifier) pairs that have | ||
at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences. | ||
""" | ||
|
||
idx_col = "_row_idx" | ||
while idx_col in df.columns: | ||
idx_col = f"_{idx_col}" | ||
|
||
return ( | ||
df.with_row_count(idx_col) | ||
.join(allowed_code_metadata, on=join_cols, how="inner") | ||
.sort(idx_col) | ||
.drop(idx_col) | ||
) | ||
|
||
return filter_codes_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor suggestion for filter_codes_fntr
function.
The function filter_codes_fntr
is well-structured and follows good Pythonic practices. However, consider adding more detailed comments within the function to explain the logic behind each step, especially within the filtering conditions. This will enhance maintainability and readability for other developers who might work on this code later.
Tools
Ruff
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
f"""Filters out outlier numerical values from patient events. | ||
|
||
In particular, this function filters the DataFrame to only include numerical values that are within | ||
{stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct f-string usage in docstrings.
As with the previous function, replace the f-string in the docstring to avoid misinterpretations by Python.
- f"""Filters out outlier numerical values from patient events.
+ """Filters out outlier numerical values from patient events.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
f"""Filters out outlier numerical values from patient events. | |
In particular, this function filters the DataFrame to only include numerical values that are within | |
{stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair. | |
""" | |
"""Filters out outlier numerical values from patient events. | |
In particular, this function filters the DataFrame to only include numerical values that are within | |
{stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair. | |
""" |
Tools
Ruff
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
f"""Filters patient events to only encompass those with a set of permissible codes. | ||
|
||
In particular, this function filters the DataFrame to only include (code, modifier) pairs that have | ||
at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace f-string docstrings with regular docstrings.
The static analysis tool flagged the use of f-string as docstrings. Python interprets these as joined strings rather than docstrings, which could lead to unexpected behaviors or documentation issues.
- f"""Filters patient events to only encompass those with a set of permissible codes.
+ """Filters patient events to only encompass those with a set of permissible codes.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
f"""Filters patient events to only encompass those with a set of permissible codes. | |
In particular, this function filters the DataFrame to only include (code, modifier) pairs that have | |
at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences. | |
""" | |
"""Filters patient events to only encompass those with a set of permissible codes. | |
In particular, this function filters the DataFrame to only include (code, modifier) pairs that have | |
at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences. | |
""" |
Tools
Ruff
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- .github/workflows/tests.yaml (1 hunks)
- src/MEDS_polars_functions/filter_measurements.py (1 hunks)
Files not reviewed due to errors (1)
- src/MEDS_polars_functions/filter_measurements.py (no review received)
Additional context used
Ruff
src/MEDS_polars_functions/filter_measurements.py
119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)
Additional comments not posted (1)
.github/workflows/tests.yaml (1)
36-36
: Enhanced Testing Configuration ApprovedThe addition of
--doctest-modules
and--cov=src
flags in the pytest command are excellent for improving test coverage and ensuring that doctests are run. Ensure that the coverage paths are correctly configured to reflect the actual source directories.
This replaces #12 , which I will close momentarily.
Summary by CodeRabbit
New Features
Refactor
Bug Fixes